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Abstract 

By using the Jensen-Shannon divergence, genomic DNA can be divided into 
compositionally distinct domains through a standard recursive segmentation 
procedure. Each domain, while significantly different from its neighbours, 
may however share compositional similarity with one or more distant (non- 
neighbouring) domains. We thus obtain a coarse-grained description of the 
given DNA string in terms of a smaller set of distinct domain labels. This 
yields a minimal domain description of a given DNA sequence, significantly 
reducing its organizational complexity. This procedure gives a new means 
of evaluating genomic complexity as one examines organisms ranging from 
bacteria to human. The mosaic organization of DNA sequences could have 
originated from the insertion of fragments of one genome (the parasite) inside 
another (the host), and we present numerical experiments that are suggestive 
of this scenario. 
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I. INTRODUCTION 



One of the major goals in DNA sequence analysis is in gaining an understanding of the 
overall organization of the genome. Beyond identifying the manifestly functional regions 
such as genes, promoters, repeats, etc., it has also been of interest to analyse the properties 
of the DNA string itself. One set of studies has been directed towards examining the nature 
of correlations between the bases. There is some evidence for long-range correlations which 
give rise to i// spectra in genomic DNA |T|,0j^]; this feature has been attributed to the 
presence of complex heterogeneities in nucleotide sequences ||. These result in hierarchical 
patterns in DNA, the mosaic or 'domain within domain' picture 0]. This structure is most 
conveniently explored through segmentation analysis based on information theoretic mea- 
sures |3],[|,[|[7| , although other schemes to uncover the correlation structure over long scales, 
such as detrended fluctuation analysis of DNA walks || or wavelet tranform technique || 
have also been applied. There have been some attempts to decode the biological implications 
of such complexity f^,|K||ll|] , but these are incompletely understood as of now. On shorter 
length scales there is a prominent 3-base correlation in coding regions of DNA; this offers 



a means of locating and identifying genes [ 12| . There are other short-range correlations as 



well |13| , p!4| corresponding to structural constraints on the DNA double helix. 

Segmentation analysis is a powerful means of examining the large-scale organization of 
DNA sequences |^,|5lJ^ JT5|Ji^JT7|jr^| . The most commonly used procedure [§]|||| is based on 
maximization of the Jensen-Shannon (J-S) divergence through which a given DNA string 
is recursively separated into compostionallly homogeneous segments called domains (or 
patches). This results in a coarse-grained description of the DNA string as a sequence 
of distinct domains. The criterion for continuing the segmentation process is based on sta- 
tistical significance (this is equivalent to hypothesis testing) |||| or, alternatively, within a 
model selection framework based on the Bayesian information criterion [0. This criterion 
can be extended and used to detect isochores , CpG islands, origin and terminus of repli- 



cation in bacterial genomes, complex repeats in telomere sequences, etc. Segmentation 
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using a 12-symbol alphabet derived from codon usage has been shown recently to delineate 
the border between coding and noncoding regions in a DNA sequence f6j. 

In the present work, we analyse the segmentation structure of genomic DNA for a class 
of genomes ranging in (evolutionary) complexity from bacteria to human. Our motivation 
is to understand the complexity of genome organization in terms of the domains obtained. 
We further aim to correlate the domain picture with evolutonary biological processes. 

By construction a given domain is heterogenous with respect to its neighbours, but it 
may nevertheless be compositionally similar to other domains. Based on this premise, we 
attempt to draw a larger domain picture by obtaining 'domain sets'. These consist of a set of 
domains which are homogeneous when concatenated. A domain set may thus be interpreted 
as a larger homogeneous sequence, parts of which are scattered nonuniformly in a genomic 
sequence. The number of domain sets constructed thus is found to be much fewer than 
the domains obtained upon segmentation ||J5|||0]. We propose here an optimal procedure, 
starting from the domains found from one of the above segmentation methods, and building 
up a domain set by adding together all its components. We then use standard complexity 
measures to show that this gives a superior model in as much as the complexity is reduced. 

This paper is organised as follows. In the next section, we briefly review the segmentation 
methods based on the J-S divergence. Section III contains our main results. We first segment 
a given genome to reveal the primary domain structure that derives from the J-S divergence. 
We then show how the domain sets are constructed, and analyse the attendant decrease in 
complexity. In Section IV, we speculate that such domain organization ocurred during 
genomic evolution when there was lateral gene and/or DNA transfer between species. To 
that end, we present the results of numerical experiments based on a host-parasite model, 
where we artificially insert fragments of one genome inside another, and demonstrate that 
this process can be uncovered via segmentation. Section V concludes the paper with a 
summary and discussion of our results. 
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II. SEGMENTATION METHODS 



In this section we briefly review the segmentation methodology that is used here in order 
to fragment a genome into homogeneous domains. Consider a sequence S as a concatenation 
of two subsequences S^ 1 ' and . The Jensen-Shannon divergence 0] of the subsequences 



is 



0(^(1)^(2)) = H ^(Djr(i) + ^(2)^(2)) _ [^H{fU) + n^H(^)}, (1) 

where J 7 ® = {/} , /i , /| }> i = 1, 2 are the relative frequency vectors, and rc^ and 7r^ 2 ^ 
their weights. In Eq. ([I]), if is the Shannon entropy (in unit of bits) 

H(F) = -Y,fi log 2 /i, (2) 

i=i 

although, as can be appreciated, a variety of other functions on the /j's can also be used as 
a criterion for estimating the divergence of two sequences. 

The algorithm proposed by Bernaola-Galvan et al. {||| proceeds as follows. A sequence 
is segmented in two domains such that the J-S divergence D is maximum over all possible 
partitions. Each resulting domain is then further segmented recursively. 

The main issue with regard to continual segmentation is that unless the significance of 
a given segmentation step is properly assessed, it is possible to arrive at segments which 
have no great significance. This question is also related to a second issue, namely when one 
should stop the recursion. Since we consider finite DNA sequences, it is again possible to 
keep segmenting until the segments are very short. Both these questions can be answered 
through one of two possible approaches which we now describe. 



A. Hypothesis testing framework 

The statistical significance of the segmentation is determined by computing the maximum 
value of the J-S divergence for the two potential subsegments, T) rnaxi and estimating the 



4 



probability of getting this value or less in a random sequence. This defines the significance 
level, s(x), as 

s(x) = Prob{D max < x}. (3) 
The probability distribution of ~D max has an analytic approximation |||| and 

s(x) = [F v (/3 ■ 2iVTn2 • x)] Neff , (4) 

where F u is the chi-square distribution function with v degrees of freedom, N is the sequence 
length, (3 is a scale factor which is essentially independent of N and k and for each k, 
N e ff = a In N + b. The values of (3 and N e ff (and thus the constants a and b) are found 
from Monte Carlo simulations by fitting the empirical distributions to the above expression 
II- 

Within the hypothesis testing framework, then, the segmentation is allowed if and only if 
s(x) is greater than a preset level of statistical significance. It is possible to segment a given 
sequence initially at a (usually very high) significance level, and these domains are further 



segmented at lower levels of significance to detect the inner structure or other patterns [15 



B. Model selection framework 

A different criterion can be evolved for stopping the recursive segmentation within the 
so-called model selection framework 0. This is based on the Bayesian information criterion 
iHHm, denoted B below, 



B = -2 log(L) + log(N)K + 0(1) + 0( J=) + 0(1), ( 5 ) 

where L is the maximum likelihood of the model, iV is the sample size and K is the number 
of parameters in the model. 

A potential segmentation based on the J-S divergence D is deemed acceptable if B is 
reduced after segmentation. From the above equation, this condition is 



2ND > [K 2 — Kx) log(iV), (6) 

where K\ and K 2 are the number of free parameters of the models before and after the 

segmentation. This is the lower bound of the significance level; an upper bound can be 

preset by using a measure of segmentation strength 0, 

_ 2NT>-(K 2 -K 1 )\og(N) 
S (Ki-KjlogiN) ■ U 

Eq. (H) is equivalent to the condition s > 0. 

III. APPLICATIONS AND ANALYSIS 

In the present work we consider DNA sequences as strings in a 4-letter alphabet 
(A, T, C, G). In the model selection framework discussed above, therefore, the relevant 
parameters are K\ — 3 (since only 3 of the 4 nucleotides are independent) and K 2 = 7 (the 
3 free parameters from each of the two subsegments, and in addition, the partition point 
which is another independent parameter) |7j . The importance of this segmentation approach 
in detecting some of the structural and functional units in DNA sequences has been demon- 
strated recently [O. The results that follow have been obtained by the application of this 
approach. 

A. Labeling the domains 

The complete genome of a bacterium Ureaplasma urealyticum (751719 bp) and a contig 
of human chromosome 22 (gi | 10879979 | ref | iVT'_011521.1 |, 767357 bp) were segmented 
at the lower bound of the stopping criterion, namely Eq. (|6]). The number of segments 
obtained by this procedure is 86 for the bacterium and 248 for human chromosome 22 
contig. Labeling each of these segments by a unique symbol gives a coarse-grained view of 
the entire sequence, say Si ■ S 2 ■ ■ ■ Sj^. 

While each segment is heterogeneous with respect to its neighbours, Sk±i, it need 
not be compositionally distinct from a non-neighbouring segment, Sj. Therefore, we now 
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examine the inter se heterogeneity of all segments with respect to each other. Segments 
Sk and Sj are concatenated, and if this 'super segment' cannot be segmented by the same 
criterion, then both Sk and Sj are assigned the same domain symbol. This is done recursively 
and exhaustively, so that within the model selection framework of segmentation, all domains 
that cannot be distinguished from one another are assigned the same symbol. This gives a 
reduced and further coarse-grained view of the domain structure of a DNA sequence. 

To ensure that the above procedure is as complete and self-consistent as possible, we 
examine each segment Sk by concatenating it with Sj and all preceding distinct segments 
that share the same domain symbol as Sj, and examine whether this larger sequence can be 
segmented. Explicitly, if segments Si and Sj have the same symbol (following the procedure 
given above) we examine the supersegment Si ■ Sj ■ Sk to determine whether segment Sk 
should share the same domain symbol or not. It is further required to to consider all 
possible subsets (Si ■ Sk, Sj ■ Sk, etc.) to ensure that all segments that are deemed to share a 
given domain symbol do indeed belong to one class, namely that such superdomains do not 
undergo further segmentation. 

Following the above, the 86 domains obtained from the segmentation of U. ure- 

alyticum are reduced to a total of 17 distinct domain types: 
S\ S2 S3 S4 1S5 S^ S_l S2 Sj_ Sq S4 Sj_ 

CCCCCCCCCCC Q 
Oq O7 02 Ol Oq 04 Oq Og O4 Og D±q O4 

Q Q O Q O Q Q Q Q O O Q 

Og O4 On <->12 1J6 04 O10 <->6 OlO 06 On ^6 

1S7 Sq S11 Sj S3 Su S3 SiQ Sq S3 Sg S\i 

5 Q Q O Q Q Q O 
10 <J4 iJU <J10 ^13 ^4 <->13 <J9 <J11 »J4 ^6 ^4 

S\i S4 S14 Sq S% Sq S14 S4 Sq S15 Sj_ Sg 
S4 S\q Sg S17 S15 Sq S17 1S7 S17 Sj_ S17 Sg 
S\q S14 

The 248 segments of human chromosome 22 also undergo simplification, to a total of 53 
distinct domain types: 
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This gives a maximally coarse-grained view of the DNA squence, in terms of "domain 
sets": these are the elements of a given domain type which may be scattered over the 
entire genome. Examples above are domains like 5i in bacterium or S13 in human which 
are widely dispersed (these are underlined for visual clarity above), suggesting that these 
fragments possibly had a common origin, or that they were inserted at the same time during 



evolution. Expansion-modification p^ , p5| and insertion-deletion [26 are thought to play 
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major role in evolution: the former ensures duplication accompanied by point mutations 
in genomes and the latter results in insertion of a part of chromosome inside a nucleotide 
sequence or deletion of base pairs from a nucleotide sequence. An initial homogeneous 
sequence may thus become heterogeneous by insertions/deletions that consistently go on 
with the evolution. Insertions may cause the pieces of a homogeneous sequence to spread. 



B. Insertion deletion and heterogeneity 

The process of insertion-deletion |26j has played an important role in increasing the 
complexity of genomes. Motivated by the simplification of domain description as above, 
we perform the following numerical experiment in order to examine the increase in com- 
plexity by such processes. Fragments of the U. urealyticum bacterial sequence of total 
length 80 Kbp are inserted at N random positions in the human chromosome 22 contig 
(g , z|10879979|re/|A r T_011521.1|). The heterogeneity will naturally increase because of such 
insertions. 

Prior to the insertion of bacterial fragments, the total number of domains in the human 
chromosome 22 contig is 248; after inserting the fragments at random positions, in a typical 
realization, the number of segments obtained is 375. The results of such experiments can be 
quantified through the sequence compositional complexity [jl8l,^7|, denoted S, 



i=l 



n m 



= ££[#(S) (8) 

i=l iV 

where S denotes the whole sequence of length iV and Si is the zth domain of length n^. This 
measure, which is independent of the length of sequence quantifies the difference or dispersion 
among the compositions of the domains. The higher the S, the more heterogeneous the DNA 
sequence. 

When fragments of very different composition are inserted into a given DNA sequence, 
the complexity will necessarily increase. We compute As = S' — S for domains obtained 



after and before the insertion for the example as above and also for a number of genomes. 
In all cases A s > 0: the compositional complexity increases after insertion. If deletion is 
also introduced, say by removing a fragment of random length from a random position (the 
range of lengths being deleted is kept same as that of the 'inserts') in general As increases 
further. 

C. Measuring the complexity 

We quantify the simplication of domain description of the two representative genomes 
by considering a complexity measure within the model selection framework, namely the 
Bayesian information criterion (B). Within standard statistical analyisis, one model is 
superior in comparison with another if it has a lower B. For the case of U. urealyticum , 
where the segmentation procedure gives 86 domains, 

B = -2 log(L) + 343 log(AT) (9) 

where K = 343 parameters correspond to 86 x 3 base compositions and 85 borders. These 
are reorganized into 17 domain sets, and thus 

B' = -2 log(L') + 136 log(TV) (10) 

(136 = 17 x 3 + 85). The maximum likelihood can be expressed as 

£(p«) = IW» ( n ) 

a 

where {p a } and {N a } are the base composition parameter and the base counts respectively 
corresponding to alphabet {a = A,T,G,C} of a sequence. A B = B' — B depends on the 
relative contribution of both terms; typically L > V since the first segmentation uses a 
more accurate measurement of base composition. The reduction in this measure comes from 
the second term through the drastic reduction in the number of domains which reduces the 
model complexity. 
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For U. urealyticum and human, A# = —1709 and —4884 respectively which shows that 
the model representative of the domain set is better than the original one (we use the lower 
bound i.e. A# < for determining the statistical significance 0). As another example, we 
found Ab for Thermoplasma acidophilum (archaeabacteria, 1564906 bp) and another contig 
of human chromosome 22 (gi | 10880022 | ref | iVT_011522.1 |, 1528072 bp) to be -2808 
and —10420 respectively. We repeated this procedure for different available genomes and 
found the above results to be consistent. Note that the simplication can also be quantified 
in terms of S and we observe As < in all cases. 

IV. IN SILICO EXPERIMENTS ON DOMAIN INSERTION: A HOST PARASITE 

PERSPECTIVE 

It is tempting to speculate that the heterogeneity that is uncovered by the segmentation 
procedures discussed above is a reflection of the evolutionary history of the given sequence, 
and in particular, that the different domains arise from insertion processes acting at different 
evolutionary times. For instance, it is well-known that the human genome contains a small 
fraction of bacterial genome which have most likely arisen from processes such as viral 
insertion or lateral gene transfer. 

To what extent can the segmentation process determine the exact pattern of insertions? 
Here we describe some simple experiments that are designed to explore this question. Start- 
ing with a homogeneous fragment of human DNA, we insert fragments from (a homogeneous 
segment of) bacterial genomes; this increases the heterogeneity. We then apply the segmenta- 
tion algorithm followed by the labeling procedure and compare the results with the (known) 
control. 

Experiments were done on a homogeneous domain set from the human genome, of total 
length 100139 bp. Into this, fragments from a homogeneous segment of length 17584 bp from 
the genome of U. urealyticum were inserted. In a representative case, we took 3 fragments 
(of lengths 5000, 7000 and 5584 bp respectively) and inserted them at locations 10000, 50000 
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and 92000 in the human genome domain. 

Upon segmentation, all seven segments were identified, with the boundaries between the 
bacterial and human DNA sequences determined as follows: 9984 (10000), 15000 (15000), 
49751, 50060 (50000), 56968 (57000), 91636 (92000) and 97575 (97584), (the exact values 
are given in brackets). There is thus one false positive, but otherwise all the boundaries 
are determined to fairly high precision. The domain sets can also be reconstructed, and the 
seven segments, <Si<S2<Si<S 2 <Si<S 2 <Si conform to two sets. 

Shown in Fig. 1(a) is the insertion process for a case where fragments from two bacterial 
genomes, Ureaplasma urealyticum and Thermoplasma acidophilum are randomly inserted in 
the human genome segment. Carrying out segmentation at varying strength s gives a greater 
number of segments compared to the correct value of 13. With s = 0.2, one gets 18 segments 
(see Fig. 1(b)) which is the best reconstruction possible within the present framework. On 
obtaining domain sets, we find that up to about 85% of human and U. urealyticum genomes 
are properly identified, the errors affecting the reconstruction of T. acidophilum which is 
only 67% accurate. 

To summarize, our results from several numerical experiments show that the reconstruc- 
tion of the fragmentation process can be done to high accuracy so long as the inserted 
fragments are sufficiently long and widely separated. 

V. DISCUSSION AND SUMMARY 

Segmentation offers a novel view of the compositional heterogeneity of a DNA sequence. 
In the present work we have applied the segmentation analysis to genomic sequences from 
several organisms. 

Our main focus has been on understanding the organization and to this end we have 
applied a number of different analytical tools. Our main analysis has been directed towards 
obtaining a coarse-grained representation of DNA as a string of minimal domain labels. 
Complexity measures indicate that the reduced model in terms of domain sets is superior 
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to a model where each domain is treated as independent. 

Insofar as the different domains are considered, our main hypothesis is that these arise 
when fragments of one (possibly homogenous) DNA sequence get randomly inserted into 
another (also possibly homogenous) sequence. A controlled set of (numerical) experiments 
give support to this hypothesis: we are able to identify domain boundaries to high accuracy 
so long as inserted domains are not very short. The accuracy could be further increased by 
improving the segmentation process, for example, using 1 to 3 segmentation rather than the 
binary or 1 to 2 segmentation used here: binary segmentation is only one of several possible 
segmentation procedures (see Ref. |T7|j ). 

A consequence of this analysis, and one that we are currently exploring, is that different 
domains (or domain sets) in one genome can have arisen via insertion from another organism. 
Homology analysis (say by the use of standard tools such as BLAST or FASTA) can help 
to unravel the origins of the domains. Thus segmentation analysis can possibly help in 
reconstructing the evolutionary history of the genome. 
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FIG. 1. (a) Representation of a DNA sequence obtained by random insertion of fragments of 
two bacterial sequences T. acidophilum (T) and U. urealyticum (U) into a human sequence (H) 
(see text), (b) The domain structure as uncovered by the procedure of segmentation and labeling 
(as described in the text). 
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